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ABSTRACT The amount of data generated by the current interconnected world is immeasurable, and a 
large part of such data is publicly available, which means that it is accessible by any user, at any time, 
from anywhere in the Internet. In this respect, Open Source Intelligence (OSINT) is a type of intelligence 
that actually benefits from that open natureby collecting, processing and correlating points of the whole 
cyberspace to generate knowledge. In fact, recent advances in technology are causing OSINT to currently 
evolve at a dizzying rate, providing innovative data-driven and Al-powered applications for politics, economy 
or society, but also offering new lines of action against cyberthreats and cybercrime. The paper at hand 
describes the current state of OSINT and makes a comprehensive review of the paradigm, focusing on the 
services and techniques enhancing the cybersecurity field. On the one hand, we analyze the strong points of 
this methodology and propose numerous ways to apply it to cybersecurity. On the other hand, we cover the 
limitations when adopting it. Considering there is a lot left to explore in this ample field, we also enumerate 
some open challenges to be addressed in the future. Additionally, we study the role of OSINT in the public 
sphere of governments, which constitute an ideal landscape to exploit open data. 


INDEX TERMS OSINT, cyberintelligence, cybersecurity, cyberdefence, challenges, national security, 
computer crime, computational intelligence, knowledge acquisition, social network services, software tools, 
data privacy, Internet. 


I. INTRODUCTION Indeed, current research is focused on (but not limited to) 
Open Source Intelligence (OSINT) consists in the collection, three main applications which are represented in FIGURE | 
processing and correlation of public information from open and are described next: 
data sources such as the mass media, social networks, forums ¢ Social opinion and sentiment analysis: Along with the 
and blogs, public government data, publications, or commer- boom of online social networks, it is possible to collect 
cial data. Given some input data, together with the applica- users interactions, messages, interests and preferences 
tion of advanced collection and analysis techniques, OSINT to extract non-explicit knowledge. The evidence accu- 
continuously expands the knowledge about the target. In this mulated from social media is far-reaching and widely 
way, the information found feeds the gathering process again advantageous [3]. Such collection and analysis could be 
to get closer to the final goal [1]. applied, for instance, to marketing, political campaign- 
Nowadays, OSINT is widely adopted by governments and sor disaster management [4]. 
intelligence services to conduct their investigations and fight * Cybercrime and organized crime: The open data is 
against cybercrime [2]. Nevertheless, it is not only utilised continuously analyzed and matched by OSINT pro- 
for state affairs, but rather applied to several different goals. cesses in order to spot criminal intentions at an early 
stage. Taking into account adversaries’ patterns and rela- 
The associate editor coordinating the review of this manuscript and tionships between felonies, OSINT is able to provide 
approving it for publication was Luis Javier Garcia Villalba ~. security forces with an opportunity to promptly detect 
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FIGURE 1. OSINT principal use cases. 


illegal actions [5]. In this direction, by exploiting the 
open data, it would be possible to track the activity of ter- 
rorist organizations, which are increasingly active on the 
Internet [6], [7]. 

e Cybersecurity and cyberdefence: ICT UInformation and 
Communication Technology) systems are continuously 
attacked by criminals aiming at disrupting the availabil- 
ity of the provided services [8]. Research becomes hence 
crucial to defend those systems from cyberattackers, 
concretely by facing the challenges that are still open 
in the field of cybersecurity [9]. In this sense, data 
sciences are not only being applied to the footprinting 
in pentestings, but also to the preventive protection of 
organizations and companies. Concretely, data mining 
techniques may help by performing analysis of daily 
attacks, correlating them and supporting decision mak- 
ing processes for an effective defense, but also for a 
prompt reaction [10]. In the same way, OSINT can be 
also considered in this context as a source of informa- 
tion for tracebacks and investigations. Forensic digital 
analysis [11] can incorporate OSINT to complement the 
digital evidences left by an incident. 

In addition to those, OSINT can be applied to other con- 
texts. In particular, one may extract relevant information by 
performing social engineering attacks. Ill-motivated entities 
leverage publicly-available information released online (e.g., 
on social networks) in order to create appealing hooks to 
capture the target [12]. Moreover, it is possible to perform 
automatic veracity assessment on the open data aiming at 
disclosing fake news and deepfakes, among others [13]. 

Nonetheless, it is important to notice that the utilization of 
public data has also compromising issues. On the one hand, 
the EU General Data Protection Regulation (GPDR) limitates 
the processing of personal data related to individuals in the 
EU zone [14]. On the other hand, there is a strong ethical 
component which is linked to the users’ privacy. In particular, 
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the profiling of people [15] could reveal personal details such 
as their political preference, sexual orientation or religious 
beliefs, amongst others. Additionally, the exploitation of such 
vast amount of information may lead to abuse, resulting 
in harming innocents through cyberbullying, cybergossip or 
cyberaggressions [16]. 

The paper at hand, which is an extension of the work pro- 
posed in [17], encompasses the present and future of OSINT 
by analyzing its positive and negative points, describing ways 
of applying this type of intelligence, and enunciating future 
directions for the evolution of this paradigm. In addition, 
a more detailed description of different techniques, tools 
and open challenges is presented in this work. Furthermore, 
we propose the integration of OSINT within the DML (Detec- 
tion Maturity Level) model to address the attribution problem 
from a different perspective in the context of cyberattacks 
investigations. We also introduce sample workflows to facil- 
itate the understanding and use of OSINT to gather valuable 
information starting from basic inputs. 

In addition, our purpose is to stimulate researches and 
advances in the OSINT ecosystem. The scope of such ecosys- 
tem is quite wide, spanning from psychology, social science 
to counterintelligence and marketing. As we have seen so far, 
OSINT is a promising mechanism that concretely improves 
the traditional cyberintelligence, cyberdefence and digital 
forensic fields [18]. The impact that this methodology could 
have on society thanks to current technology and the large 
number of open sources is still unexploited. There is still 
a long way ahead to explore in this topic, and this article 
presents some future appealing research lines. 

The remainder of this paper is organized as follows. 
SECTION II offers a review of recent research works in 
the field of OSINT. SECTION III discusses the motivation, 
pros and cons of the development of OSINT. SECTION IV 
explains the principal OSINT steps and practical workflows 
to carry them out. Then, SECTION V includes an in-depth 
description of OSINT-based collection techniques and ser- 
vices. SECTION VI analyzes and compares some OSINT 
tools that automatize the OSINT collection and analysis 
of information. SECTION VII proposes the integration of 
OSINT in the investigation of cyberattacks. SECTION VIII 
focuses on the impact of OSINT within a nation, not only for 
the sake of its internal cyberdefence operations, but also as 
a beneficiary of transparency policies. Spain is specifically 
taken as a reference for affinity and contextualized with the 
rest of the world. SECTION IX poses some open challenges 
regarding research in OSINT. Finally, SECTION X concludes 
with some key remarks, as well as future research directions. 


Il. STATE OF THE ART 

In recent years, with the advances of big data and data 
mining techniques, the research community has noticed that 
open data represents a powerful source of analyzing social 
behaviors and obtaining relevant information [19]. Next we 
describe some remarkable works pivoting around each of the 
three aforementioned principal use cases for OSINT. 
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With regards to the use of OSINT for extracting social 
opinion and emotions, Santarcangelo et al. [20] proposed a 
model for determining user opinions about a given keyword 
through social networks, specifically studying the adjectives, 
intensifiers and negations used in tweets. Unfortunately, it is 
a simple keyword-based solution only designed for Italian 
language, not taking into account semantic issues. On the 
other hand, Kandias ef al. [21] could relate people usage 
of social networks (in particular, Facebook) to their stress 
level. However, the experiments were carried out only with 
405 users, while nowadays there is a chance of processing 
much larger amounts of data. Another interesting study is 
conducted in [22], where authors applied Natural Language 
Processing (NLP) to WhatsApp messages in order to possibly 
prevent the occurrence of mass violence in South Africa. 
Unfortunately, the investigation is limited to text messages, 
thus excluding vital information which can be disclosed 
through multimedia material. 

In the context of cybercrime and organized crime, there 
are several works that explore the application of OSINT 
for criminal investigations [23]. For example, OSINT could 
increase the accuracy of prosecutions and arrests of cul- 
prits with frameworks like the one proposed by Quick and 
Choo [11]. Concretely, authors apply OSINT to digital foren- 
sic data of a variety of devices to enhance the criminal 
intelligence analysis. In this field, another opportunity that 
OSINT yields is the detection of illegal actions as well as the 
prevention of future crimes such as terrorist attacks, murders 
or rapes. In fact, the European projects ePOOLICE [24] and 
CAPER [25] were designed to develop effective models for 
scanning open data automatically in order to analyze the 
society and detect emerging organized crime. In contrast to 
the previous mentioned projects, whose proposals were not 
practically used in real cases, Delavallade et al. [26] describe 
a model based on social networks data that is able to extract 
future crime indicators. Such model is then applied to the 
copper theft and to the jihadist propaganda use cases. 

From the point of view of cybersecurity and cyberde- 
fence, OSINT represents a valuable tool for improv- 
ing our protection mechanisms against cyberattacks. 
Hernandez et al. [27] propose the use of OSINT in the 
Colombian context to prevent attacks and to allow strategic 
anticipation. It includes not only plugins for collecting infor- 
mation, but also machine learning models to perform senti- 
ment analysis. Moreover, the DiSIEM european project [28] 
maintains as a first goal the integration of diverse OSINT 
data sources in current SIEM (Security Information and Event 
Management) systems to help reacting to recently-discovered 
vulnerabilities in the infrastructure or even predicting pos- 
sible emerging threats. In addition, Lee and Shon [29] also 
designed an OSINT-based framework to inspect cyberse- 
curity threats of critical infrastructure networks. However, 
all these approaches have not been applied to real world 
scenarios, thus their effectiveness remains questionable. 

Extending the dissertation to other application fields, 
in [30] authors demonstrate how to passively recollect 
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significant information on organizational employees in an 
automated fashion. Such information is then related to 
the analysis of the so-called social engineering attack 
surface, showing the effective feasibility of the proposed 
approach. Then, the authors propose a set of potential coun- 
termeasures, including a publicly-available social engineer- 
ing vulnerability scanner which companies may leverage in 
order to reduce the exposure of their employees. 
Furthermore, a systematic review of approaches, method- 
ologies and tools which are proposed by the academy to 
conduct automatic veracity assessment of publicly-available 
data is performed in [31]. Specifically, the authors studied 
107 research items between 2013 and 2017 to argue on the 
state-of-the-art of veracity assessment, which has become a 
great concern during the last decade due to the spread of 
fake news and deepfakes. In this direction, the authors out- 
line the relative immaturity of this field, identifying several 
challenges which will characterize future research trends. 


Ill. OSINT ADVANTAGES AND SHORTCOMINGS 

The fields of application of OSINT are numerous and the 
solutions being developed under this paradigm are increasing. 
However, behind this methodology there is a trade-off that 
developers and engineers have to deal with. From a technical 
point of view, as we can see in TABLE 1, OSINT exposes a 
number of benefits, but it has to deal with some restrictions 
too, which are detailed next. 


A. OSINT BENEFITS 

1) HUGE AMOUNT OF AVAILABLE INFORMATION 

There is currently a large volume of worthwhile open source 
data to be analyzed, correlated and linked [32]. This includes 
social networks, public government documents and reports, 
online multimedia content, newspapers and even the Deep 
web and the Dark web [33], among others. Actually, both the 
Deep Web and the Dark Web (the latter circumscribed within 
the former) contain even more information than the Surface 
Web (i.e., the Internet known by most users) [34]. In order to 
be able to access these networks, it is necessary to use specific 
tools since their contents are not indexed by traditional search 
engines. 

Unlike the Surface Web and most of the Deep Web, 
the Dark Web offers anonymity and privacy to users who 
utilize it. This property facilitates criminals to employ this 
network to surf, conduct their searches and publish with 
illegitimate purposes while hiding their identity. Therefore, 
the Dark Web is an ideal source to apply OSINT and fight 
against cybercrime, organized crime or cyberthreats. On the 
other hand, the pursuit and de-anonymization of these people 
are current non trivial challenges for OSINT to properly 
work [35]. 


2) HIGH COMPUTING CAPACITY 
Advances in computer architecture, processors and GPUs 
(graphic processing units) enable to carry out labor-intensive 
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TABLE 1. OSINT pros and cons in a nutshell. 


Pros Y/ 


Cons X 


Huge amount of available information 


Complexity of data management 


High capacity of computing 


Unstructured information 


Big data and machine learning 


Misinformation 


Complementary types of data 


Data sources reliability 


Flexible purpose and wide scope 


Strong ethical/legal considerations 


operations in terms of collection, processing, analysis and 
storage [36]. Thanks to this feature, we have the opportunity 
to apply OSINT considering large amounts of public informa- 
tion and mixing a high number of data sets, relationships and 
patterns from different types of open sources, while applying 
advanced processing and analysis techniques. 


3) BIG DATA AND MACHINE LEARNING 

Emerging proliferation of data analysis and data mining 
techniques, as well as machine learning algorithms, which 
can automate and make investigation and decision making 
processes more intelligent and efficient [36]. It allows spot- 
ting complex correlations that are naturally unpredictable to 
humans. This point will be key in future OSINT activities, 
as it will mark the difference between human-driven and arti- 
ficial intelligence-led research. By incorporating those tech- 
niques, the process of collection and analysis will definitively 
improve, thus resulting in accurate investigations close to our 
goal. Additionally, government counterintelligence agencies 
can leverage such paradigm to further enhance the quality 
of managed information and, consequently, the battle against 
terrorist organizations [37]. 


4) COMPLEMENTARY TYPES OF DATA 

Possibility of feeding OSINT with other types of informa- 
tion [38]. The inherent structure of the system is open enough 
to include data that has not actually been obtained from 
open sources. This fact means that OSINT can be even more 
effective if we are able to add external pieces of information to 
complement investigations. For example, Law Enforcement 
Agencies could take advantage of citizens collaboration to 
feed OSINT searches, intelligence services could leverage 
classified information about cybercriminals or incidents to 
enrich OSINT investigations, or even common users could 
combine OSINT with social engineering to profile their 
target. 


5) FLEXIBLE PURPOSE AND WIDE SCOPE 

Due to the nature of OSINT, investigations can be extended 
to lots of problems and can collect pieces of information all 
over the cyberspace. This paradigm could be used for eco- 
nomic, psychological, strategic, journalistic, labor or security 
aspects, among others. In particular, we could highlight the 
benefits in the field of crime and cybersecurity, where OSINT 
could monitor suspicious people or dangerous groups, detect 
influencing profiles related to radicalization, study worrying 
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trends of the society, support the attribution of cyberattacks 
and crimes, enhance digital forensic analysis, etc. [5], [18]. 


B. OSINT LIMITATIONS 

1) COMPLEXITY OF DATA MANAGEMENT 

The quantity of data is huge and, consequently, it is challeng- 
ing to handle it efficiently and effectively [39]. It is beneficial 
for OSINT to consider as much information as possible, but 
also to have advanced techniques and significant resources to 
ensure high quality collection, processing and analysis. 


2) UNSTRUCTURED INFORMATION 

The public information available on the Internet is inherently 
massively disorganized. This means that the data collected 
by OSINT is so heterogeneous that turns it tough to classify, 
link and examine such data in order to extract relevant rela- 
tionships and knowledge [4]. In this sense, OSINT requires 
mechanisms such as data mining, Natural Language Process- 
ing (NLP), or text analytics to homogenize the unstructured 
information in order to be able to exploit it. 


3) MISINFORMATION 

Social networks and communication media are flooded with 
subjective opinions, fake news and canards [4]. For this rea- 
son, the existence of inaccurate information has to be taken 
into account in the implementation of OSINT mechanisms 
and should not drive the propagation of the search. OSINT 
activities should always deal with reliable information and 
follow trusted exploration lines to ensure positive and 
convincing outcomes [40]. 


4) DATA SOURCES RELIABILITY 

The trustworthiness and authority of the information are 
indeed the key for successful OSINT investigations [41]. 
Ideally, the collected data should come from authoritative, 
reviewed and trusted sources (official documents, scientific 
reports, reliable communication media) [39]. In practice, 
OSINT will also coexist with subjective or non-authoritative 
sources, such as the content of social networks or manipulated 
media [42]. Even though this type of sources is more prone 
to misinformation, it is actually where more knowledge can 
be extracted to investigate people, groups or companies. If 
the credibility of the open sources of information represents 
indeed a limitation, it becomes even more challenging con- 
sidering the possible ambiguity of users’ queries to retrieve 
the desired information [43]. 
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5) STRONG ETHICAL/LEGAL CONSIDERATIONS 

Numerous concerns about privacy, respect and personal 
integrity emerge with the development of OSINT [44]. In 
this direction, it has to be noted that the question of whether 
OSINT constitutes an ethical issue is generally situated 
within the area of the ethics of intelligence collection [45]. 
On the one hand, although publicly accessible, OSINT has 
the power to disclose information that is not explicitly posted 
on the web. Uncovered results should respect users’ privacy 
and not reveal intimate and personal issues [15], while taking 
into account current related regulations (such as GPDR [14]). 
To this extent, aspects such as sexual orientation, religious 
beliefs, political inclination or compromising behaviours can 
be inferred from the Internet, and this disclosure process can 
be problematic in many countries today. On the other hand, 
the scope of OSINT-based searches should be, by definition, 
limited to open data sources. Under no circumstances access 
controls or authentication methods can be bypassed to extract 
knowledge. 


IV. OSINT WORKFLOWS 

OSINT, like any other type of intelligence, has a well-defined 
and precise methodology. From our scientific-technical point 
of view, we are particularly interested in three steps. 

Firstly, in the collection phase, publicly available data is 
retrieved from relevant open sources according to the target 
or objective. In particular, the Internet is the resource par 
excellence due to the volume of existing material and easy 
accessibility. The collection process is particularly relevant 
because from this stage onwards the whole process of intelli- 
gence generation is triggered. 

Then, in the analysis phase, the collected raw material is 
treated to generate valuable and comprehensible information. 
The data by itself is not useful, so it has to be interpreted to 
obtain the first facts derived from an in-depth analysis. 

Finally, in the knowledge extraction process, the infor- 
mation purified previously is taken as input for more sophis- 
ticated inference algorithms. Thanks to the computational 
advances of current era, it is possible to detect patterns, profile 
behaviours, predict values or correlate events. 

It is worth mentioning that the second and third steps com- 
prise technologies widely used and known in the context of 
data mining. However, the OSINT collection approach differs 
from current data-driven services. Nowadays, common data 
analysis applications gather as much information as possible 
from pre-defined data sources and implement clear gathering 
processes. On the contrary, OSINT solutions should collect 
specific facts from the sea of all possible and reachable open 
resources. 

In order to face this latter challenging uncertainty and 
go one step further, we propose in FIGURE 2 a practical 
framework to carry out OSINT-based investigations. We have 
included those exploration paths which are worthwhile to fol- 
low for optimizing the analysis of collection results and max- 
imizing the extraction of knowledge. This high abstraction 
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scheme includes the most clear transactions, representative 
elements and outstanding operations. 


A. OSINT COLLECTION 

Before the analysis and intelligence extraction steps, 
the investigator has to expand the dataset about the target. 
With this aim, we propose some OSINT techniques to rep- 
resent different collection strategies. In particular, we have 
considered search engines, social networks, email address, 
username, real name, location, IP address and domain 
name OSINT techniques (as we will further describe in 
SECTION V). Under each one, there will be innumerable 
OSINT services with similar ways of collecting data. 

In this phase, it is assumed that, at least, an atomic piece of 
data about the target is available (e.g., real name, username, 
email address, etc.). From that initial seed and according to 
its nature, the investigator applies the most suitable OSINT 
techniques to derive more data. In this sense, the results 
obtained with a specific technique are a data transfer to be 
used by another type of technique. These represented transac- 
tions illustrate possible ways of propagating the investigation, 
where the output of the technique of origin becomes the input 
to feed the technique of destination. 


B. OSINT ANALYSIS 

The continuous iterations through the different OSINT tech- 
niques should be analyzed and understood to generate valu- 
able information. There is an increasing amount of analysis 
techniques in the literature to do this task [46], highlighting 
below those appealing procedures which are applicable in our 


scenario: 
e Lexical analysis: Raw data should be examined to 


extract entities and relations from text. It is essential to 
apply translation processes to the language used in the 
OSINT investigation [47] and filter noise which does not 
add value from sentences that do not add value. 

e Semantic analysis: Having a bag of words is not useful 
if the meaning is not extracted [48]. With this purpose 
of understanding data, natural language processing algo- 
rithms are being used nowadays [49]. In addition, sen- 
timent analysis techniques permit the contextualization 
of subjective posts or opinions to classify the emotional 
status of the author (e.g, positive, negative or neutral). 
Finally, truth discovery procedures address the challeng- 
ing task of resolving conflicts in multi-source data which 
stands opposing positions on the same subject [50]. 

e Geospatial analysis: Recollected data from social net- 
works, events, sensors or IP addresses are worthwhile to 
be analyzed from a location-based perspective. In this 
sense, the usage of maps or graphs facilitates the rep- 
resentation and comprehension of data [51], as well as 
extracting meaningful connections between incidents or 
persons. 

e Social media analysis: The features brought by modern 
social media allow researchers to carry out in-depth 
analysis of users [52]. In such a scenario, the analysis of 
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social data allows the creation of a network of contacts, 
interactions, places, behaviours and tastes around the 
subject. 


The results of launching the aforementioned techniques are 


considered as output info and are categorized into three main 
groups: 


e The personal information fuses the person identity 


details which are mainly obtained from the real name, 
email address, user name, social networks and search 
engines techniques. 


e The organizational information is formed by aspects 


of a team or company composed of individuals. It is 
essentially collected by means of social networks, search 
engines, location, domain name and IP address tech- 
niques. 


e The network information covers technical data of sys- 


tems and communication topologies which is usually 
achieved through location, domain name and IP address 
techniques. 


Logically, these three blocks of information can be 


expanded with more elements. Moreover, a single investiga- 
tion may have different types of output info that complement 
each other. 


C. OSINT KNOWLEDGE EXTRACTION 


The value of the information collected so far is unquestion- 
able. However, the intelligence extraction of those findings 
leads actually to what will provide an attractive recogni- 
tion of the target [53]. To this end, we consider the knowl- 
edge elicitation as the treatment of the analysis results 
(output info) making use of data mining and artificial 
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intelligence techniques. In the following we mention some 
really promising technologies at this stage: 
e Correlation: Detection of relationships between peo- 


ple, events or pieces of data in general [54]. Strong 
related features are specially valuable to reveal those 
non-explicit associations existing in the dataset. 


e Classification: The data can be divided in groups 


according to predefined categories (supervised learn- 
ing) [55]. This technique permits the organization of 
large amounts of information for more effective knowl- 
edge extraction [56]. 


e Outlier detection: This procedure analyzes the dataset 


and detects anomalies in it [57]. They are particularly 
interesting for the observation of malignant agents, 
whose behaviour or actions differ from the general 
population. 


e Clustering: It assigns pieces of data into clusters, being 


able to consider big amount of conditions or heuris- 
tics [58]. This could reveal, for example, different ways 
of behaving in the network, various types of online 
profiles or categorizing forms of attacking individuals, 
organizations or infrastructures [59] without knowing 
the existence of that diversity beforehand (unsupervised 
learning). 


e Regression: The main objective of this technique is to 


forecast or predict numeric values or facts [60]. For 
example, a linear regression returns a value attending 
to a linear function, a neural network is a structure that 
maps complex combinations of inputs to an output, or 
deep learning that is made up of several layers that 
combine and make operations with the input. 
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e Tracking patterns: Differing from anomaly detection, 
pattern recognition is a process for detecting regular- 
ities in data [61]. The methods mentioned above can 
be included in this knowledge-discovery broad concept. 
In fact, any artificial intelligence technique is suitable 


for open data knowledge extraction. 
These intelligent techniques allow inferring abstract, 


complex and juicy issues about the target that are not explic- 
itly published on the Internet [62]. However, this process 
poses several challenges, mainly residing in researching and 
developing this knowledge extraction process to identify, 
profile or monitor criminals, recognize and explore mali- 
cious organizations or uncover and attribute cybernetic inci- 
dents. In addition, several privacy considerations arise due 
to the powerful inferences that are potentially achievable. 
The extracted knowledge about a person, company or orga- 
nizations may be specially sensible and its manipulation 
indirectly leads to ethical and legal problems (specifically 
addressed in SUBSECTION IX-F). Indeed, we should never 
lose sight of the fact that these techniques could be even 
misused to directly harm people or groups (deeper analysis 
in SUBSECTION IX-G). 


V. OSINT COLLECTION TECHNIQUES AND SERVICES 

As it has been shown, OSINT is quite promising and pow- 
erful, but its implementation is also challenging. . In fact, 
the first consideration is that it precises data as departure 
point. Fortunately, the volume of raw data is not a problem 
nowadays due to the existence of the Internet. In addition, 
there is also an increasing number of applications, known in 
this context as OSINT services, that precisely facilitate the 
gathering on the web. 

In the following, a summary of the most common OSINT 
techniques is presented. Within each technique, the most 
outstanding associated OSINT services at the time of writing 
are shown, giving hints on how to effectively exploit their 
potentialities. It is worth mentioning that OSINT services are 
ephemeral and can even increase or decrease. On the contrary, 
the OSINT technique is a broader concept that will endure 
over time. 


A. SEARCH ENGINES 
Google, Bing or Yahoo search engines, among others, are well 
known and widely used tools. The traditional use of them is 
the simplest way of applying OSINT. These engines search 
within the World Wide Web given a textual query trying to 
provide information that matches with the input, working 
really well and returning valuable information to the user. 
Nevertheless, the number of results can be so overwhelm- 
ing that it can even be counterproductive for the user. For 
that reason, a good investigator should know how to specify 
the requests within a search engine according to the desired 
outcome. Services like Google or Bing support filters to refine 
searches,! and retrieve exactly the type of information we 


| https://support.google.com/websearch/answer/2466433 
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are interested in. For instance, the use of “” permits exact- 
matches, OR and AND act as logical operators, or * as a 
wildcard. It also allows the introduction of conditions like 
filetype to specify a certain file type, site to limit results to 
those from a specific website, or intitle to find pages with 
certain keywords within their title. TABLE 2 contains some 
operators that can be used to refine Google and Bing searches. 

Yahoo, in turn, does not permit specific filters, but we 
can restrict the date, language or country of the results. The 
case of the DuckDuckGo search engine is specially inter- 
esting because it does not track the user, nor it targets 
the IP address or the search history. This privacy-preserving 
approach makes the findings homogeneous for all users, 
regardless of habits, preferences, location, or search history. 

Moreover, some search engines have been designed for 
specific territories. Yandex is well-known in Russia and East- 
ern Europe, and implements search operators to restrict 
the search by URL, file type, language, date, and so on. 
Baidu is another specific search service widely used in Asia. 
It includes not only the typical keyword search bar, but addi- 
tional worthy resources for OSINT such as a social network, 
a section of questions and answers, a virtual library or an 
encyclopedia, among others. There are also search engines 
for the Arabic community such as Yamli or Eiktub, but they 
are much less employed. This type of services is particularly 
interesting in investigations about people, groups and compa- 
nies belonging to specific communities. 

Finally, it is mandatory to know specific search engines 
to browse the Dark Web. OSINT investigations against drug 
traffic, child pornography, weapon sales or terrorism are very 
benefited from exploring these not-so-popular resources. 
To this end, Ahmia and Torch are search engines available 
for use within the Tor anonymous network [63]. However, 
the researcher will have to deal with the anonymity of this 
network and sites. 


B. SOCIAL NETWORKS 
Nowadays, the exposure of the daily life of individuals and 
organizations in social networks is evident. Any curious per- 
son has realized that lot of personal information can be found 
with no advanced knowledge needed about these platforms. 
As shown in TABLE 3, these applications offer precise search 
possibilities in the context of OSINT. Next we describe some 
of the most known and used social networks worldwide. 
Facebook is a social network spread all over the world with 
millions of users. It could be considered a diary of society, 
where one can find very valuable personal information for 
OSINT investigations. The profile of our target can reveal 
his employment, education, age, location, visited places or 
liked groups, among others. The photos and publications may 
also help us contextualize the company or person we are 
investigating, the areas it frequents or the type of activities 
he/she realizes. In addition, it is also possible to search by 


2https://yandex.com/support/search/query-language/search- 
operators.html 
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TABLE 2. Some Google/Bing filters for advanced search. 


Google/Bing filter Search operator Example of use 

Force an exact-match search ye “University of Murcia” 
Exclude a term or phrase - university murcia -catholic 
Search for X or Y OR, | university murcia|cartagena 
Search for X and Y (used by default) AND university AND of AND murcia 
Use of a wildcard * university of * 
Search for a range of numbers os university murcia 2010..2019 
Group terms or search operators () “university of (murcia|cartagena)” 
Search within a given domain site: university murcia site:um.es 
Search for a certain file type filetype: university murcia filetype:pdf 
Search in page titles intitle: university intitle:umu 
Search in URLs inurl: university inurl:um 
Search in the text of the pages intext: university intext:murcia 
Search the most recent cached version of a page | cache: cache:um.es 

TABLE 3. Potential of various social networks. 
Social Network Type Scope Main potential for OSINT 
4chan Online community | Worldwide Users interested in illicit activities 
Badoo Dating Worldwide Intimate and personal details 
Cloob Social connections | Iran Personal profile, posting and community membership 
Draugiem Social connections | Latvia Personal profile, publications in blogs, group membership 
Facebook Social connections | Worldwide Personal profile, preferences and places visited 
Facenama Social connections | Iran Personal profile, publications, photos and videos 
Flickr Photo-sharing Worldwide Activities, hobbies, places and personal relationships 
Instagram Social connections | Worldwide Habits, locations and personal relationships 
LinkedIn Business Worldwide Professional profile, education, skills and languages 
Mixi Social connections | Japan Personal profile, interests and opinions 
Odnoklassniki Social connections | Mainly Russia Personal profile of adults, past and present friendships 
Qzone Social connections | Mainly China Personal profile, preferences, habits 
Reddit Online community | Worldwide Users trends, behaviors, and publications 
Renren Social connections | Mainly China Personal profile of students, friendships and discussions 
Taringa! Social connections | Mainly Latin America | Personal profile, publications and community membership 
Tinder Dating Worldwide Intimate and personal details 
Tumblr Photo-sharing Worldwide Activities, hobbies, places and personal relationships 
Twitter Social connections | Worldwide Personal profile, opinions and publications 
VKontakte (VK) | Social connections | Mainly Russia Personal profile, preferences and publications 
Weibo Social connections | Mainly China Personal profile, opinions and publications 
YouTube Video-sharing Worldwide Video content, opinions and comments of subscribers 


location when the real name is not known, being able to 
ultimately find the profile of our target. 

YouTube is a video-based platform where big communities 
are conformed around shared interests. It is not only valuable 
the content uploaded by an specific user (themes, images, 
scenes, places, and people appearing in videos), but also the 
opinions and comments of subscribers. 

Twitter is mainly utilized for live communication where it 
is common to find personal publications through an ordered 
timeline. Apart from the personal information revealed by the 
profile, it is particularly interesting the extraction of the opin- 
ions from published tweets, the relationships with followed 
and follower users or the likes in certain publications. From 
this type of interactions, an OSINT investigator can infer the 
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orientation of the target on certain issues, the interests and 
preferences of an organization, or how dangerous a person 
might be. Additionally, a user-friendly interface? is available 
where it is possible to search on the whole platform by 
keywords, exact phrases, hashtags, language, date and so 
on. Thus, we can even define explorations through users, 
mentions or responses. 

Instagram is also widespread in the modern society as 
a mean of sharing photos. The places, persons and activ- 
ities shown in pictures can also assist us in profiling our 
target. The location is a quite sensitive data that is fre- 
quently shared on this platform. In this sense, we can also 


3 twitter.com/search-advanced 
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mention more specific photo-sharing services like Tumblr 
or Flickr. 

LinkedIn is the most popular site in the context of business- 
related social networking. It permits searching by real name, 
company, organization, title or location. In this case, the pro- 
fessional profiles can reveal full contact data, including email 
addresses and cellular telephone numbers. In addition, we can 
also extract information about the employment, education, 
skills, languages and business relationships. 

It is also worth considering those dating websites used 
to contact people in search of a mate. Unlike other social 
networks, where many users restrict their personal details, 
more intimate aspects are usually revealed in here. For this 
reason, services like Tinder or Badoo are useful for investigat- 
ing the background information, personal character, interests, 
preferences or behaviour of the target. 

Finally, it is possible to browse online communities which 
are very similar to social networks. The posts and topics of 
these forums generate interesting interactions to be analyzed 
by OSINT [64]. Reddit or 4chan are big communities which 
host countless threads of discussion and opinion where really 
personal and private information about the target can be 
identified. However, in these websites users are commonly 
anonymous. Additionally, it is not rare to find illicit content 
of bullying, pornography or threats. 

On the other hand, there are also some social net- 
works which are typically used within specific regions. 
The following services are specially important in some 
countries. 

Qzone, Weibo and Renren are some of the most used social 
networks in China. The first one is a very customizable 
platform where users publish blogs, diaries, photos or music 
which reveal details about the person. The second one has 
similar features to Twitter, but also including polls, file shar- 
ing and stories (temporal photo and video sharing). The last 
one is widespread among college students. Those OSINT 
investigations whose target is a Chinese person can get a 
valuable profit from these sites. 

There are also social networks to interconnect Russian 
compatriots and eastern European citizens. In this regard, 
VKontakte, also known as VK, is very popular. The function- 
alities, and even the appearance, are quite similar to Face- 
book. Users are able to stay involved with friends, participate 
in online communities, post messages, photos, and videos in 
private or public pages, and even share files. Another Russian 
site to highlight is Odnoklassniki, mainly used by adults. 
In fact, the main purpose of its users is to have an online 
profile, keep in touch with real-life friendships and search 
former companions or past friends. In this sense, OSINT can 
be conducted to discover people-to-people connections from 
the past to now. 

In Japan, Mixi is a very common social networking site in 
society. Apart from typical functionalities, we could highlight 
the possibility to make reviews to products, create personal 
blogs within the platform, participate in communities or man- 
age music preferences and listening habits. 
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For Spanish-speaking countries, specially Latin America, 
Taringa! is a well known social platform for sharing photos, 
videos and news with friends. In addition, users are able to 
create communities, play online games or share music. 

Finally, due to the existing censorship with external ser- 
vices, in Iran the most popular local social networks are Face- 
nama and Cloob. The first is mainly used for sharing posts, 
photos and videos whereas the second includes community 
discussions, photo sharing, posting or chat rooms. Something 
similar about censorship occurs in Latvia, where Draugiem is 
widely used to share contents and communicate online. 


C. EMAIL ADDRESS TECHNIQUE 

Searching by a person’s real name can be frustrating due 
to potentially duplicated names, so it is sometimes worth 
starting from an email address which is unique and achieves 
much better results at a faster pace. There are some interesting 
OSINT services, as it is shown in TABLE 4, that work with 
an email address as an input. 

First of all, Hunter can be used to determine whether 
an email address is valid or not. Then, Have I Been Pwned 
informs whether a given email address is contained in public 
breaches (so that it has been compromised at some point). 
In particular, it is worth mentioning that the investigator can 
browse the list of sites where the email address was compro- 
mised. These services are potential sources for finding pub- 
lic information about the owner. Another worthwhile page 
is Pipl, which works really well to find information about 
the owner of an email address such as the real name, user- 
names, address, telephone number, education, professional 
career, etc. 


D. USERNAME TECHNIQUE 

The nicknames used for online services are also a good 
way to collect information regarding a person, as shown in 
TABLE 5. Visiting these services will allow an investigator 
to automatically check a username in several websites at the 
same time to identify more sources of information. 

The services KnowEm, Name Chk, Name Checkr, or 
User Search verify the presence of a given username on the 
most popular social networks and domains. 

NamevVine, in turn, provides an interesting feature that 
helps when trying to guess an exact username. Concretely, 
it suggests profiles for the top ten social networks which 
partially match with the given username. This real time 
solution offers a fast verification of username variants (for 
instance, changing the final number of the nickname) instead 
of launching time-consuming queries repeatedly with other 
services. 

The website Lullar uses a different approach. It automati- 
cally generates URLs to visit the username profile in different 
social networks without checking if they exist. If a link works, 
then the profile exists for that social network, whereas if it is 
broken it obviously means the opposite. In addition to speed- 
ing up manual checking, the most useful application would 
be to explore possible usernames when the one we have is 
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TABLE 4. Utility of the OSINT services belonging to the email address technique. 


Email address OSINT service URL Main output 

Hunter hunter.io Validity and availability 

Have I Been Pwned haveibeenpwned.com | Appearance in public data breaches 
Pipl pipl.com Personal information about the owner 


TABLE 5. Utility of the OSINT services belonging to the username technique. 


Username OSINT service URL 


Main output 


KnowEm knowem. com 


Name Chk namechk.com 


Presence in social networks, domains 


Name Checkr 


namecheckr.com 


and online communities 


User Search 


usersearch.org 


Name Vine namevine.com 


Suggestions of alternative similar usernames 


Lullar 


com.lullar.com 


Availability in social networks 


TABLE 6. Utility of the OSINT services belonging to the real name technique. 


Real name OSINT service URL Main output 
Pipl pipl.com Personal information 

That’s Them thatsthem.com 

Spokeo spokeo.com 

Fast People Search fastpeoplesearch.com | Personal details, education, professional career, 
Nuwber nuwber.com skills, locations, and relatives. 
Cubib cubib.com 

Peek You peekyou.com 

Yasni yasni.com Social networks profiles 
Family Search familysearch.org 

ce™ peaeee Kinship information, relatives 
Family Tree Now familytreenow.com : 

True People Search truepeoplesearch.com 


questionable or partial. When the initial URL fails, similar or 
alternative users are often listed by the social networks which 
can be used to identify the entire existing username. 


E. REAL NAME TECHNIQUE 

Searching a target real name could also yield good results, 
as shown in TABLE 6. Apart from social networks, particular 
services are capable of revealing home addresses, telephone 
numbers, email accounts, usernames, among others. 

We could highlight Pip/ as the website that returns more 
information given a first and last name. Due to possible 
multiple results for the same real name, it is possible to refine 
the search by including additional aspects of the person such 
as email, phone, country, state, city, username or age. 

That’s Them also offers a remarkable output contain- 
ing phone number, email address, residence, associated IP 
address, economic situation, education, occupation or lan- 
guage. Another well-known service is Spokeo, whose free 
version is reduced to show full name, gender, age, previous 
cities and states of residency and relatives. More detailed 
information about the target requires to pay a premium sub- 
scription, which is out of our scope. Similar services would 
be Fast People Search, Nuwber, Cubib or Peek You. 
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The aforementioned services work correctly for the United 
States, but if we want to apply OSINT to a target that lives in 
another country, the use of Yasni is more appropriate. How- 
ever, the results obtained are links related to social networks, 
addresses and personal contacts, education, and miscellany. 

Genealogy services like Family Search, Family Tree Now, 
GENi, or True People Search cover another point of view in 
searches by providing kinship information. Discovering the 
family links of our target broadens the amount of information 
we can unveil, in this case indirectly. 


F. LOCATION TECHNIQUE 

Researching the locations that our target frequents can give us 
indications of his/her habits and context. It is also interesting 
to know the geographic location of a company or the place 
where an event occurred. In this sense, images, addresses and 
GPS coordinates are worthwhile data to obtain. TABLE 7 
shows some services which are particularly designed to these 
purposes. 

Google Maps, Wikimapia or Bing Maps are well known 
sites to find out locations from GPS coordinates. On the other 
hand, it is also possible to reversely get such information 
from a location name at GPS Coordinates. 
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TABLE 7. Utility of the OSINT services belonging to the location technique. 


Location OSINT service URL Main output 

Google Maps google.com/maps 

Wikimapia wikimapia.org Locations from GPS coordinates 
Bing Maps bing.com/maps 

GPS Coordinates gps-coordinates.net | GPS coordinates from location 
Historic Aerials historicaerials.com 

Terra Servers terraserver.com Historic images of the past 
Land Viewer eos.com 


TABLE 8. Utility of the OSINT services belonging to the IP address technique. 


IP address OSINT service URL Main output 


IP Location iplocation.net Location, domain and ISP 

ViewDNS viewdns.info Technical network-based information 
That’s Them thatsthem.com/reverse-ip-lookup | Individual or company information 

I Know What You Download | iknowwhatyoudownload.com Torrent files 


Note that the images offered by the commented services 
are continuously updated. However, we could be interested 
in retrieving old images of past situations. Historic Aerials, 
Terra Servers or Land Viewer incorporate historic imagery 
functionalities to precisely discover past and outdated views 
of locations. 


G. IP ADDRESS TECHNIQUE 

IP addresses are obtained from cyberattack investigations, 
email addresses or connections over the Internet. They are 
also crucial for digital forensic analysis in order to col- 
lect as much information as possible from an incident. 
TABLE 8 summarizes some services which facilitate these 
tasks. 

The service [P Location obtains, from a given IP address, 
high-level aspects such as location (latitude and longitude), 
country, region, city, domain name or ISP (Internet Service 
Provider). If we are interested in specific facts, the website 
ViewDNS provides more technical information apart from the 
IP location. In particular, it includes services for displaying 
registration information about the associated domain name, 
showing additional domains hosted on the IP address, discov- 
ering common ports that may be open and services running 
on them, or seeing the network path from ViewDNS to the 
target IP address and analyze associated networks, routers, 
and servers. 

Nevertheless, the previous resources provide data that 
is not sensitive or personal in nature. On the contrary, 
That’s Them does offer interesting information about people, 
home addresses, companies, or emails addresses related with 
the given IP address. 

Another powerful service providing personal information 
is Know What You Download. This service monitors online 
torrents and discloses the files associated with any collected 
IP addresses. The files downloaded by our target could reveal 
really sensitive information about his behaviour or interests. 
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H. DOMAIN NAME TECHNIQUE 

A typical point of interest in OSINT investigations are web 
pages. They can reveal interesting information about our 
target, specially whether we are dealing with a person or a 
company. It is worth noting that the majority of techniques 
which are explained for IP addresses are also suitable in this 
context. In addition to them, we can highlight some other 
services as presented in TABLE 9. 

DNS Trails extracts DNS records, but also identifies the 
number of additional domains that are related to the encoun- 
tered results. To this extent, it is a very helpful way to find 
relationships and connections. Whoisoly also shows a cross- 
reference view from the owner name, address, telephone 
number or email address. 

Another powerful service is Wayback Machine, which 
periodically makes backups of many websites from the whole 
Internet. This allows an investigator to analyze the evolution 
and changes of a website, being able to see it for particular 
screenshots dated in time. 

Furthermore, it is possible to visualize domain connections 
through Visual Site Mapper or Threat Crowd. Checking DNS 
and mailservers is also useful by visiting Whois, which also 
offers a ping functionality for checking the connectivity and 
a traceroute functionality to study the data path to the given 
domain. There are also services like Alexa and SimilarWeb 
which calculate traffic statics and others like FindSubdomains 
which search for subdomains. 


VI. OSINT TOOLS 

A manual use of some techniques would be enough for 
basic searches. Unfortunately, using a few services might 
not be effective for challenging investigations. In this sense, 
the potential of OSINT lies in using as many services as 
possible in a concatenated fashion. Following the workflows 
repeatedly will extend the available information to put all the 
pieces of the puzzle together. However, it is not practical for 
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TABLE 9. Utility of the OSINT services belonging to the domain name technique. 


Domain name OSINT service URL Main output 
DNS Trails securitytrails.com/dns-trails | DNS records and related domains 
Whoisoly whoisology.com Personal or company information 
Wayback Machine web.archive.org/web Backups of websites 
Visual Site Mapper visualsitemapper.com : 
PP te Map of subdomains 
Threat Crowd threatcrowd.org 
Whois who.is Registration info and DNS records 
Alexa alexa.com : 
a — Traffic statics 
SimilarWeb similarweb.com 
FindSubdomains findsubdomains.com Subdomains 
TABLE 10. Main features of the selected OSINT tools. 
OSINT tool Input Output Extensibilit Interface Platform Other feature 
Identity Network ; Selectable P y 
File data 
data data data source 
as Google, Identity info, ‘ 
FOCA x Domain a ri Bing, Network info, x oe Windows Berver a 
oneet DuckDuckGo File info program meee 
Personal Location, 
information. Identity info, Custom Stand-alone Linux, Auto input 
Maltego Domain File URL x Network info, A fe : Windows, output refeed, 
companys File info Nera re program MAC Results in 
community é 
oriented graph 
. 5 a Network info, Command Linux, Option to narrow 
Metagoofil x Doan File type K File info * line Windows results 
tact Location 
: Identity info, fe 
Recon-NG : Renee Domain x Several Network info, x Command Linux Modules for 
information iar? line discovery and 
File info Bia 
exploitation 
Operating 
Country, system, : 
Shodan City, IP Address, x x Nemworks x ; “oe Online Location, 
info interface Webcam captures 
Keyword Port, 
Host name 
Email, i renee Custom Web Linux, Pee a 
Spiderfoot Real name, , x Several Network info : Windows, - 
Subnet, modules interface Results in 
Phone Number MAC : 
Host name oriented graph 
. ‘Nos 6 Linux, Results in reports, 
The Harvester Company Domain, x Several Identity info, x Command Windows, Option to narrow 
DNS server Network info line 
MAC files and results 
Personal File name Location, 
IntelTechni information, Domain, File type : Several Identity info, x Web Onli Public records, 
Teer POGUES: company, IP Address File ie Network info interface coca OSINT virtual 
community machine 


the end user to manually combine several OSINT techniques 
and their associated services. Such a tedious task would entail 
lengthy research processes. 

For this purpose, researchers and developers have imple- 
mented more precise tools for applying OSINT techniques 
automatically and gathering better quality information from 
many different sources, implementing several workflows 
internally and, as a consequence, obtaining further rewarding 
information and better inferences. 

TABLE 10 presents the main features of the most popular 
and relevant OSINT tools today. We indicate the type of 
inputs and outputs they allow, the capability of including 
custom functionalities, the type of user interface, the platform 
of functioning and other interesting miscellany features. 

Nevertheless, there are a lot of OSINT applications in the 
literature which can be accessed at OSINT framework.* 


4 osintframework.com 
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A. FOCA 

The main contribution of FOCA> (Fingerprinting Organi- 
zations with Collected Archives), designed by ElevenPaths, 
is the extraction and analysis of the metadata present in 
electronic documents. This application can be used for both 
local files present in our computer and external documents 
that are downloaded from a specified webpage using three 
different search engines (Google, Bing, and DuckDuckGo). 
FOCA considers a wide variety of formats such as 
Microsoft Office, PDF, Open Office, Adobe InDesign, SVG 
files, etc. 

This application extracts the hidden information of the files 
and processes them to show the user relevant aspects. Some 
of the details that are discovered with this procedure are the 
name of computers related to the documents, the location 
where the documents were created, operating systems used, 


5 https://www.elevenpaths.com/es/labstools/foca-2 
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real names and email addresses of related users, data about 
the servers, date of creation of the documents, range of IP 
addresses of internal networks, etc. As aresult, anetwork map 
can be drawn based on the extracted metadata to recognise the 
target. 

FOCA additionally includes a server discovery module 
to complement the metadata analysis of documents. Some 
techniques used in this tool are: (i) Web Search for searching 
hosts and domain names through URLs associated to the 
given domain; (11) DNS Search for discovering new hosts and 
domain names through the NS, MX and SPF servers; (iii) JP 
Resolution for obtaining the IP addresses of encountered 
hosts through the DNS; (iv) PTR Scanning for finding more 
servers in a discovered network segment; (v) Bing IP for 
extracting new domain names associated to encountered IP 
addresses. 

This tool is usually used in the security sector as it allows 
pentesting a company. In fact, it is able to output very good 
results because companies do not usually clean metadata 
from files that are uploaded to the network. 


B. MALTEGO 

Maltego® is a well-known application that automatically finds 
public information about a certain target within different 
sources (DNS records, Whois records, search engines, social 
networks, various online APIs, files metadata, etc). The rela- 
tionships between the found items of interest are represented 
in the form of a directed graph for its analysis. This tool 
defines four main concepts: 


e Entity: is a node of the graph representing the discov- 
ered piece of information. Some default entities are real 
name, email address, username, social network profile, 
company, organization, website, document, affiliation, 
domain, DNS name, IP address, and so on. Furthermore, 
we could also define custom entities for our specific 
investigation. 

e Transform: is a piece of code which is applied to an 
entity to discover a new linked entity. For example, 
the transform ‘“‘To IP Address” which resolves a DNS 
name to an IP address, could be applied to a domain 
name entity ““um.es” to create a new IP address entity 
“155.54.212.103”. Recursively, we would con- 
tinue applying more transforms, propagating the process 
of search. Apart from default transforms, it is also pos- 
sible to implement and include custom ones for more 
specific purposes. 

e Machine: is a set of transforms that are defined together 
to be executed in order to automate and concatenate long 
processes of search. 

e Hub Item: is a group of transforms and entity types 
used to allow users of the community to reuse them. 
By default, Maltego implements the hub item called 
“Paterva CTAS” which contains the entities, trans- 
forms and machines maintained by official developers. 


Shttps://www.paterva.com/web7/buy/maltego-clients.php 
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In addition, it is possible to create and install third party 
hub items. 


C. METAGOOFIL 

Metagoofil’ works similarly to FOCA. It is a gathering 
tool which downloads public files found in a target domain 
or URL and extracts their metadata to output knowledge. 
It generates a useful report for pentesters with usernames, 
real names, software versions, and servers or machine 
names. It can also find further documents that could contain 
resources names. 

Although it is a command line functionality, some interest- 
ing options in favor of OSINT investigations are permitted. 
Apart from specifying the target domain or the local folder 
to analyze, Metagoofil allows filtering filetypes (pdf, doc, 
xls, ppt, odp, ods, docx, xlsx, pptx), narrowing down the 
results to search and the number of documents to download, 
determining the working directory where downloaded files 
are saved, or selecting the file to write the output. 


D. RECON-NG 

Recon-NG® is a web recognition framework similar to 
Metasploit.° It presents a command line interface that allows 
one to select a module to use, which is essentially an OSINT 
resource. Then, we set some parameters if necessary and 
launch the process. The results of the searches are continu- 
ously saved in a workspace which in turn feeds next rounds 
of the process. 

This tool includes several independent modules that imple- 
ment different functionalities. For example, the modules 
Bing Domain Web and Google Site Web search in Bing and 
Google search engines respectively for hosts connected to 
the domains of the workspace; PGP Search scans the stored 
domains to find email addresses associated with public PGP 
keys; Full Contact gathers users and corresponding social 
networks profiles in its database considering stored contacts; 
or Profiler searches for additional online services that pos- 
sess accounts with the same user names as those in the 
workspace. 

Recon-NG is continuously agglutinating in a local database 
all the obtained information. In this way, the user directs the 
research by selecting the indicated module and the tool auto- 
mates the generation of knowledge from there. The system 
scales remarkably for complex investigations. 


E. SHODAN 

Shodan!” is a search engine that provides public informa- 
tion of Internet-connected nodes, including IoT devices. This 
includes servers, routers, online storage devices, surveillance 
cameras, webcams or VoIP systems, amongst others. The 
recollection of data is made through protocols like HTTP or 


Thetps:// github.com/laramies/metagoofil 
8https://bitbucket.org/LaNMaSteR5 3/recon-ng/wiki/browse 
*https://www.metasploit.com/ 

!Ohttps://www.shodan.io 
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SSH, allowing the user to search by IP address, organization, 
country name or city. 

This tool is mainly used for network security (to find 
devices exposed to the outside or detecting vulnerabilities 
of publicly available services), internet of things (to monitor 
the growing usage of smart devices and their location in the 
world geography), and tracking ransomware (to measure the 
infection provoked by this type of attack). It allows down- 
loading the results in JSON, CSV or XML formats, as well 
as generating user-friendly reports. 

In addition to the mentioned functionality, there 
are two premium services, namely: Shodan Maps 
(maps.shodan.io), permitting investigations based on 
locations, and Shodan Images (images .shodan. io) dis- 
playing collected images from public devices. 


F. SPIDERFOOT 

Spiderfoot'' is another reconnaissance tool that automatically 
goes through lots of public data sources to compile informa- 
tion. Our input could be an IP address, subnet, domain name, 
e-mail address, host name, real name or phone number. The 
results are represented in a graph of nodes with all the entities 
and relationships found. 

Depending on the type of input introduced, this tool 
autonomously selects the modules (equivalent to Maltego 
transforms) to activate for a more effective reconnaissance. 
Moreover, it also considers the level of search selected by 
the user. Spiderfoot offers four types of scans: (i) Passive 
collects as much information as possible without touching the 
target site, avoiding being unveiled by the target; (ii) Inves- 
tigate conducts a basic scan in order to find out target’s 
maliciousness; (iii) Footprint identifies the network topology 
of the target and gathers information from the web and search 
engines, sufficient for standard investigations; and (iv) All, 
which is advisable for detailed investigations, despite taking 
a long time to complete, as it consults absolutely all possible 
resources related to the target. 

This tool could be used to launch penetration tests to reveal 
data leaks and vulnerabilities, red team challenges, or to 
support threat intelligence. In addition, it is worth noting that 
it is possible to program custom Spiderfoot modules. 


G. THE HARVESTER 
The Harvester! allows the collection of public information 
related to a domain or company name through search engines. 
In particular, it is capable of listing emails and host names of 
the company, as well as subdomains, IP addresses and URLs 
related to the domain. It also permits user-friendly HTML or 
XML representations of the results. This resource is used in 
the early stages of a penetration test. 

This tool is managed from the console and implements two 
options when scanning our target website. On the one hand, 
The Harvester represents the original script which actually 


iM https://www.spiderfoot.net 
!2 https ://github.com/laramies/theharvester 
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provides the list of related email addresses, whereas, on the 
other hand, EmailHarvester improves the procedure by dig- 
ging deeper for better results. 


H. INTELTECHNIQUES 

IntelTechniques'? is a tool, created by Michael Bazzel, which 
offers hundreds of online search utilities grouped by tech- 
nique. 

When using it, the investigator selects the services to be 
used and this tool automatically creates the associated query 
links. Afterwards, the user can enter them in the browser to 
launch the queries. However, the visualization and collection 
of the information is still manual. 

In spite of the fact that it does not implement an automatic 
integration of services, we have considered InterTechniques 
as a OSINT tool that facilitates the launch of searches to a 
wide range of services from a centralized platform. 

Unfortunately, this tool ceased to be free and blocked its 
open access as of July 2019 due to constant cyberattacks. 


I. OSINT TOOLS COMPARISON 
Depending on the user needs (see TABLE 10), some tools 
will be more suitable than others for a given task. 

Thus, if we intend to extract hidden information from 
files, FOCA and Metagoofil are specific tools designed for 
this purpose. In particular, the first product seems to be more 
complete, mature and powerful than the second one. FOCA 
presents additional functionalities, apart from the metadata 
analysis of files, to complement the hidden information. 
As a result, it is able to infer more knowledge about the 
target. 

Yet, if we are looking for network information, Shodan, 
Spiderfoot and The Harvester are recommended options for 
this certain task. On the one hand, we would suggest 
Spiderfoot to analyze the topology of the target and retrieve 
internal (but public) information about the target organiza- 
tion. On the other hand, we would complete the results with 
Shodan to include specific information about IoT devices, 
surveillance cameras, webcams, VoIP systems, or smart ser- 
vices in general. 

Last but not least, if the aim of the search is to 
gather as much information as possible for a given input, 
the resources Recon-NG and Maltego are the more com- 
plete ones and will return diverse data and relationships. 
The first one contains lots of modules and interacts with a 
local database that scales during the investigation, being an 
ideal framework to carry out pentestings, phishing and social 
engineering attacks prevention, or even the profiling of a 
person. On the contrary, if we want to avoid the command line 
and opt for a more user-friendly interface, Maltego is a good 
alternative for OSINT activities. It implements automated 
inference processes with transforms that raise the scope of 
the original search. Moreover, it is extensible with custom 
discovery procedures. 


7 https://inteltechniques.com 
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Despite the fact that the above described comparison has 
been made according to the desired output, in practice the 
user will be restricted by the available input and the data 
type accepted by the chosen OSINT tools. Finally, note that 
these tools are complementary and mutually non-exclusive, 
meaning that a deep and thorough OSINT investigation could 
profit from several of them at the same time. Although some 
of them may produce similar results for a given search, there 
can always be details found by a particular tool that are not 
obtained by others. 


VII. INTEGRATION OF OSINT IN CYBERATTACK 
INVESTIGATIONS 

The implementation of mechanisms for detection of and 
response to cyberincidents is an obligation today. Compa- 
nies and organizations, which are increasingly exposed on 
the Internet, invest in cybersecurity to protect their assets 
against criminals. Therefore, it is remarkably important to 
manage threats and incidents against information systems 
effectively. 

Cyberdefence is not only the deployment of technical 
solutions such as firewalls, IDSs Untrusion Detection Sys- 
tems), IPSs Untrusion Prevention Systems), SUEMs (Security 
Information and Event Management) or anti-viruses to avoid 
known threats, but also the implantation of cyberintelligence 
to extract and analyze traces, patterns and conclusions from 
the incidents. In fact, the continuous cycle of extracting and 
sharing evidences, relationships, and consequences of inci- 
dents is known as threat intelligence [65]. It complements the 
traditional defence mechanisms with up-to-date information 
and highly improves the protection of the infrastructures, 
the management of the hazards and the effectiveness of the 
responses [41]. 

Moreover, the information that is typically used for 
forensics and investigations is merely technical. However, 
the traces left by a cyberattack contain valuable informa- 
tion that should not only be contrasted with repositories of 
incidents [66], but also with social networks, forums, media, 
technical and governmental documents and other digital pub- 
lic sources. These open sources contribute with semantic 
information in the analysis, which result to be interesting 
for computing and reasoning more complex and far-reaching 
inferences. Note that cyberattackers use the Internet for their 
illegal actions (hacking, phishing, denial of service attacks, 
botnets, identity theft, intrusions, etc.), but also for personal 
reasons. In this sense, OSINT can be used to connect all those 
points. 

Several works applying OSINT to cybersecurity focus 
on proposing defensive improvements when facing threats. 
On the contrary, very seldom they seek the identification of 
cyberattackers. OSINT is a source of knowledge that could 
support the investigation of a cyberattack by going from the 
smallest details of the malicious action to the root of the 
problem. This last challenge is not new, since it is tradi- 
tionally known as the attribution problem [67]. Concretely, 
OSINT would allow us to understand the motivation of the 
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cyberattack, to guess the procedure and to ultimately profile 
the perpetrator. 

The suggested application of OSINT is illustrated in 
FIGURE 3. Note that several methodologies and models 
have been proposed to define the detection maturity of an 
organization, which is crucial to extract evidences from a 
suffered cyberattack. Nonetheless, there is a lack of standards 
to represent taxonomies and ontologies in this field [68], 
thus we propose a modified version of Ryan Stillions’ DML 
model [69] to exemplify this section. However, another 
cyberthreat detection scheme could be used to show the appli- 
cation of OSINT in a similar way. 

The DML model represents in a hierarchical way different 
levels of abstraction in the detection of cyberattacks. A com- 
pany that does not invest in cybersecurity will only be able to 
reach the lowest steps in the stack. On the contrary, an orga- 
nization technically skilled in cyberdefence may interpret 
more complex facts, that is, to ascend to levels with more 
abstraction. 

While the lower levels can be easily covered, the challenge 
lies in reaching the higher layers. To this end, we suggest 
applying OSINT as a source of intelligence that feeds on the 
most basic evidence to arrive at more robust facts: 

1) Firstly, we assume that it is possible to cover levels 
DML-1 and DML-2. The first one, Atomic indicators 
of compromise (IOC), is composed by details as simple 
as a string in a modified file, the value of a memory 
cell or a byte transmitted through the network, which 
have very low value on their own, but together form 
the next level. The Host and Network Artifacts layer is 
built upon the indicators observed during or after the 
cyberattack such as IP addresses, domain names, logs, 
transactions, hash values, or file manipulation details. 
As this type of data resides in the affected informa- 
tion systems, in our framework it is considered as an 
input for the collection of associated information in 
open sources (see SECTION V for more details about 
OSINT collection). Therefore, the extraction of these 
traces is the starting point of an OSINT process. 

2) Next we have from level DML-3 to level DML-6. The 
third level Tools consists in detecting the transfer, pres- 
ence and functionality of the tools used by the attacker. 
The following level Procedures is covered if one is able 
to enumerate the steps performed during the incident. 
The fifth level Techniques extracts how the attacker has 
specifically performed the various phases of the attack. 
And the last level here, Tactics, is a more abstract con- 
cept that takes into account the levels discussed above 
and derives knowledge by analyzing a set of activities 
in time and context. 

In this case, the information reveals details about the 
execution of the cyberattack. Such data highly enriches 
the analysis phase of the OSINT cycle. The patterns 
derived from this data, as well as the correlation with 
other cases already stored, allow us to have a more 
intelligent and comprehensive analysis. In fact, these 
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FIGURE 3. OSINT integration with DML model to address the attribution problem. 


conclusions should be integrated in conjunction with 
the results obtained in the collection phase. In this 
way the exploration through the network is refined, 
narrowing the investigation towards the final objective. 

3) Finally, the continuous gathering and analysis pro- 

cess of OSINT generates valuable information to 
which knowledge-extraction techniques are applied. 
The knowledge extracted with OSINT from level 
DML-1 to DML-6 would allow us to reach the highest 
levels, that is, from DML-7 to DML-9. The seventh 
level, Strategy, refers to a high-level description of the 
planned attack of the cybercriminal to complete his/her 
purposes. The eighth level, Goals, are the specific 
objectives of the attacker and express the real motiva- 
tion of the action. At the top we find the Identity level, 
which is essentially the name of a person, an organ- 
isation or even a country which is responsible of the 
malicious actions. As it is extremely difficult to find 
that detailed information, the connection with other 
cyberattacks and the similarity with other events can 
support the relative attribution [67]. That is, completing 
the investigation of the current case with additional 
information about other incidents apparently caused by 
the same actor brings us closer to the absolute identifi- 
cation of the cyberattacker. 

This application of OSINT represents an innovative line of 
action to fight against cyberthreats. The challenge resides in 
implementing effective mechanisms of collection and intel- 
ligent analysis procedures to extract those high-level details 
that can not be directly extracted from malicious actions. 
Such details are the most complicated pieces of information 
to achieve, as they have a very high degree of abstraction that 
are long away from the technical details. That is why it is 
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smart to look to open sources for any relationship or pattern 
that leads us to discover more about the context and origina- 
tors of an incident. OSINT is the key piece that was missing in 
the gear to profile cyberattackers and to improve the detection 
of sophisticated attacks [70] thanks to the consideration of 
high-level behaviour aspects from DML-3 to DML-9. 


VIII. OSINT IN COUNTRIES AND STATES 

OSINT is not only beneficial in the private sector, but 
also represents a resource of public interest in govern- 
ments. In this regard, in SUBSECTION VII-A we dis- 
cuss that OSINT is not a paradigm designed for paranoid 
analysts or computer geeks, but indeed has an enormous 
benefit in the cyberdefence national system [71]. Likewise, 
in SUBSECTION VII-B we observe that official authorities 
do not only get profit from OSINT results for internal tasks, 
but indirectly make the application of OSINT easier for third 
parties. In fact, they become an agent that generates large 
amounts of data accessible to everyone. In this sense, govern- 
ments are a double-edged sword which benefit from OSINT 
but at the same time they contribute to feed the Internet with 
really valuable, and sometimes even sensitive, information. 


A. INTERNAL STATE AFFAIRS OPERATIONS 

Intelligence Agencies have been traditionally associated with 
the labour of Law Enforcement Agencies (LEAs) and Mili- 
tary Bodies. In the same way, OSINT is considered nowadays 
as an important key of classified investigations and secret 
operations in state affairs [5]. To some extent, one could 
safely argue that the exploitation of OSINT can provide 
critical capabilities for LEAs to complement and enhance 
their counterintelligence departments in the investigation and 
strategical planning to fight against crime [72]. 
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As far as we were able to explore in the official 
websites, reports and documentation, government organi- 
zations seem to implement internal mechanisms which 
basically consist in gathering raw information and trans- 
forming it into useful knowledge, leveraging OSINT 
mechanisms [73]. In a representative way, we could mention 
the U.S. Federal Bureau of Investigation (FBI, fbi.gov), 
U.S. Central Intelligence Agency (CIA, cia.gov), 
Canadian Security Intelligence Service (CSIS, canada. 
ca/en/security-intelligence-service) , Euro- 
pean Union Agency for Law Enforcement Cooperation 
(EUROPOL, europol.europa.eu), North Atlantic 
Treaty Organization (NATO, nato.int), United States 
Department of Army (DA, army.mil), U.S. Department 
of Defense (DoD, defense. gov), U.S. National Security 
Agency (NSA, nsa.gov) or European Defence Agency 
(EDA, eda. europa. eu), amongst others. 

In this scenario of uncertainty, we have decided to par- 
ticularly investigate the case of Spanish LEAs, for affinity, 
to demonstrate that official organisms internally indeed apply 
OSINT. As a result of this thorough inspection, we can 
emphatically confirm that it is not easy to find clear evidences 
of the application of OSINT by the state forces. The confiden- 
tiality of this type of agencies makes it difficult to discover 
their internal operating mode and the impact of OSINT in 
their current investigations. Nevertheless, as a consequence 
of the deep search, we have some subtle findings that confirm 
that OSINT is currently used by Spanish LEAs: 


e Back in 2007, the director of the CNI (i.e., Spanish 
National Intelligence Agency) said!* that open sources 
were “‘fundamental to the elaboration and work of Intel- 
ligence” 

e CIFAS (i.e., Spanish Military Intelligence Agency) also 
seems to use OSINT as a way of obtaining information. 
We have found some slides that confirm this, dated as 
early as in 2008, which are uploaded in the Spanish 
Defense Staff website. !> 

e In 2010, when the director of the CNI announced!® the 
creation of an ethical code for special agents, he also 
insisted on the fact that modern intelligence was not 
just based on physical presence, as today “you might 
get more information sitting on a computer, exploring 
messages from the bad guys’’. 

e More recently, in 2017, the Spanish Ministry of Defense 
opened a public call!” for the contract called ‘“‘Develop- 
ment of OSINT tool based on IDOL HAVEN platform’. 

e In the present, the Spanish Army is designing a 
new model called Brigade 2035 which incorporates 


'4https -//www.elconfidencialdigital.com/articulo/vivir/CNI-califica- 
fundamental-abiertas-contradice/2007 1023000000049386.html 


15 http://www.emad.mde.es/Galerias/EMAD/novemad/fichero/EMD- 
CIFAS-esp.pdf 

16https -//www.lavanguardia.com/politica/20 100624/5395 1898847/el- 
director-del-cni-anuncia-un-codigo-etico-para-los-agentes-secretos.html 

'Thttps ://contrataciondelestado.es/wps/wem/connect/ff96fa82-7fd6- 
40bd-be5b-36ef3fd4e65b/DOC_CN2017-498874.pdf?MOD=AJPERES 
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innovative technological advances for enhancing oper- 
ations. In this project,!® one of the defined combat func- 
tions is Intelligence, which clearly states OSINT as a 
key responsibility: “Other facilities of growing impor- 
tance will be open source obtainment (including social 
networking)”. 

e The Spanish Ministry of the Interior has published in the 
Annual Recruitment Plan for 2019!° some investments 
in “systems for obtaining OSINT in the cyberspace’. 


Bearing in mind all these facts, it seems that currently 
OSINT is indeed relevant in the internal affairs of Spain. 
Analogously, we could also highlight that European Union 
state members are also highly developed in OSINT [74]. 


B. OPEN DATA POLICIES AND TRANSPARENCY 

OSINT depends on the public data available on the Internet, 
among other sources, to be effective. In this regard, apart from 
social networks and other open data sources, there are also 
authoritative and official sites maintained by state institutions 
around the world where public information is published and, 
therefore, openly available. 

The Open Data Barometer (ODB)”? is a global ranking 
system designed by the World Wide Web Foundation that 
measures the readiness, implementation and impact of coun- 
tries’ open data policies. In Figure 4 is shown the scores of 
latest full edition.”! 

As we have already done in the previous subsection, 
we study the specific case of Spain for affinity. In fact, 
regarding the aforementioned ODB report, Spain is ranked in 
the 11th position. Besides, according to the European Data 
Portal and its official reports?” about Open Data maturity 
across Europe, Spain is one of the most advanced countries 
in transparency and open data. It has been in first or sec- 
ond position in the ranking of Open Data Maturity in the 
last four years. As it is stated, the Spanish Government has 
promoted more than 160 open data initiatives and has over 
23,800 public information catalogues. For example, the Open 
Data Initiative of the Government of Spain? is a clear proof 
of how Spain encourages transparency. OSINT could benefit 
from that, but it should deal with aggregated and statistical 
information by linking it and inferring new knowledge. 

There are also anonymized databases that, a priori, would 
not be useful for OSINT because they lack the value to 
produce intelligence. These so-called anonymous datasets do 
not break the link between the data and its owner, apparently. 
Recently, an algorithm [75] has been published allowing 
99.98% of Americans to be unequivocally identified from 
public data. In particular, it is enough to have 15 parameters 


related to medical, behavioral and socio-demographic 
: 8 www.ejercito.mde.es/en/estructura/briex_2035/principal.html 


!http://www.defensa. gob.es/Galerias/gabinete/ficheros_docs/2019/ 
PACDEF_2019_Documento_Pxblico.pdf 


20 https://opendatabarometer.org 

2 https://opendatabarometer.org/4thedition 

ae https://www.europeandataportal.eu/en/dashboard#2018 
23 https://datos.gob.es/es 
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FIGURE 4. Transparency scores by the 4th edition of Open Data Barometer. 


information such as marital status, sex or the zip code of their 
home. Therefore, OSINT could again be used to re-identify 
people collected in anonymized databases. 

On the contrary, there are also governmental plat- 
forms which are actually not anonymized. For instance, 
the Spanish Ministry of the Treasury, the Spanish 
Ministry of the Interior or the Spanish Ministry of 
Defense usually publish documents with personal infor- 
mation (“site:hacienda.gob.es filetype:pdf 
intext:dni”’, for example). In the same way, this could be 
also applied to Spanish Autonomous Communities websites. 
Moreover, Europe has a public data platform** too, where 
we could find a lot of public information. For instance, in the 
context of foreign policy and security, an updated list of 
financial sanctions is presented in the “European Union Con- 
solidated Financial Sanctions List’? document. In particular, 
it reveals personal information about individuals, groups and 
entities. 

All the aforementioned facts demonstrate that governments 
worldwide are adopting strong Open Data policies. As a 
direct consequence, the amount of objective data available 
on the Internet is rapidly increasing. OSINT should, in addi- 
tion to other open sources of information, take advantage 
of this powerful opportunity to collect, analyze, link and 
infer knowledge from reliable and official sources. In this 
scenario, and according to the ODB, countries such as United 
Kingdom, Canada, France, United States, Korea, Australia, 
New Zealand, Japan, Netherlands, Norway, or Brazil are real 


?4http://data.curopa.eu/euodp/en/data 
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OSINT goldmines with very similar characteristics to those 
commented for Spain. 


IX. OPEN CHALLENGES AND FUTURE TRENDS 

The review carried out on OSINT shows that there is already a 
substantial amount of work in the topic. Numerous techniques 
and tools have been developed up to now. However, there are 
some gaps and limitations in this field to continue exploit- 
ing the offered opportunities. It is necessary to make more 
sophisticated solutions applicable to uncontrolled scenarios 
of the real world. We have spotted some challenges that, as far 
as we know, are open nowadays and should be faced by the 
research community in the next future. 


A. AUTOMATION OF THE GATHERING PROCESS 

The greater the amount of information collected, the more 
likely it is to create inferences and relationships. However, 
the quantity of public data available today is enormous and 
can not be collected in a manual way [76]. Although OSINT 
techniques (Section V) and tools (Section VI) are already a 
big step forward in this direction, most of them are still largely 
dependent on the end user. In this sense, it would be appealing 
to incorporate more sophisticated techniques. We highlight 
current big data techniques such as Web crawling or Web 
scraping [77] as potential paradigms to automate and improve 
the OSINT exploration of high volumes of open data. 

An important aspect of the recollecting process is 
the propagation of the search. The results obtained with 
searches should refeed the following rounds of gathering. 
In OSINT it is really powerful to extract pivots permitting 
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the concatenation of outputs as new inputs for propagation. 
This recursive method increases the scope of research and is 
closely related to the analysis process that we will discuss 
next. 


B. ENHANCEMENT OF THE ANALYSIS AND KNOWLEDGE 
EXTRACTION PROCESSES 

The interpretation of the recollected open data is a key point in 
the OSINT procedure. Extracting the essence of the scraping 
results, making relationships between separated pieces of 
information, or inferring conclusions that are not explicitly 
exposed increases the quality of the results. Indeed, the recur- 
sive integration with the propagation of further rounds of 
investigation is enhanced by means of better inputs. 

However, as far as we know, OSINT analysis is not imple- 
menting intelligent mechanisms today. The existing tools are 
limited to throwing all the information found and its explicit 
relationships. On the contrary, the analysis process should 
incorporate semantic analysis, study of patterns, correlation 
with other events, occurrences or datasets. 

Fortunately, modern data mining techniques [78] such 
as Natural Language Processing, Social Network Analysis, 
Machine Learning or Deep Learning are actually designed 
to solve this type of challenges. A proper selection of algo- 
rithms in this field of knowledge will make the difference 
between the current static analysis and the future reasoned 
processing [79]. 

Ideally, the OSINT of the future should be able to provide 
the end user with the specific piece of information he/she is 
searching, as well as to return convincing answers in investi- 
gations. The original search would also have, not only direct 
inferences, but also indirect and not explicit relationships. 

This challenge builds the path between the Second 
Generation and the Third Generation of OSINT. As it is pre- 
sented in [1], the Second Generation started with the rise of 
Internet and Social Media, and the challenges were “techni- 
cal expertise, virtual accessibility and constant acquisition’. 
In contrast, the evolution to the Third Generation is supposed 
to appear nowadays and will have to include “direct and 
indirect machine processing of data, machine learning, and 
automated reasoning”’. 


C. INTEGRATION OF SEVERAL OPEN DATA SOURCES 
OSINT activities should consult as many sources as possible 
in order to cover the widest possible spectrum. It is not a good 
idea to focus our research on a single social network or a 
specific forum. In this sense, success lies in combining data 
sources to obtain the best possible results. This means that 
the system has to normalize the available information, which 
is typically unstructured, in order to perform an effective 
analysis and correlation. As a result, it is important to discard 
repeated items. In fact, the different OSINT techniques and 
tools explained in this paper are actually applying such sitting 
to gather the knowledge related to the target. 

On the other hand, the real challenge is to incorporate, 
not only several data sources, but different types of data 
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sources [80]. Apart from data extracted from the Internet, 
Dark Web and Deep Web, the OSINT workflow should also 
consider information collected face to face, with social engi- 
neering, or with citizens collaboration. Any piece of informa- 
tion which is interesting to our investigation has to be used 
in order to achieve the next milestone of the search. Addi- 
tionally, it is a must the implementation of truth discovery 
processes for those cases when information from different 
data sources is contradictory [81]. 


D. FILTERING OUT IRRELEVANT DATA 

AND MISINFORMATION 

Due to the huge amount of data publicly available, an OSINT 
process needs to be capable of distinguishing the relevance 
of each piece of information, discarding data which do not 
add quality to the investigation [82]. A researcher cannot 
focus on exploring the details of an entire website, reading 
a multi-page news item or analyzing a complex government 
document. On the contrary, OSINT research needs to extract 
keywords which actually provide value and reveal knowledge 
about our target. The piece of information we are interested 
in may not be explicitly posted, and the challenge would be 
to extract the essence of the data source we are scrutinizing. 
At the same time, the precise terms extracted serve as pivots 
to create new paths of exploration. 

Furthermore, it is crucial to detect misinformation that 
would corrupt the results [83]. By nature, the Internet is 
subjective and the majority of the content has no guarantee 
of being reliable and official. The OSINT community has 
to determine whether the increasing reliance on open source 
data is still combined with the sources validation, which rep- 
resents a primary requirement and priority [84]. That untrue 
information can divert our search, leading to erroneous results 
or far from our real objective. For that reason, it would be 
interesting to analyze not only the objective information, 
but also the false information with the aim of extracting 
intelligence. 

This problem will be present in real-life research. The data 
sources where we will find more valuable information about 
suspects will be in forums and social networks. In these sites, 
the investigator has to deal with opinions, subjective publi- 
cations, and personal preferences whose veracity is question- 
able [85]. Profiling of persons who in reality do not represent 
a threat (false positives) could provoke discriminatory and 
unfair attitudes that could affect the victims. 


E. EXTENSION ACROSS THE WHOLE WORLD 

One of the main drawbacks of many of the existing OSINT 
resources is that they only function for specific countries, 
reducing their profiling capability to a constrained group of 
people belonging to a few nationalities. However, OSINT 
should be a universal technique to tour all the corners 
of the Earth instantly without discriminating zones of the 
cyberspace. Thus, interoperability is a desirable property to 
be considered in OSINT design as it would increase, not only 
the scope of the searches, but also its usage by end users. 


VOLUME 8, 2020 


J. Pastor-Galindo et al/.: Not Yet Exploited Goldmine of OSINT 


IEEE Access 


Ideally, a good OSINT service or tool should not distin- 
guish between countries and take each research as a global 
task, without borders. The OSINT workflow should com- 
bine points of information across the world and correlate 
those distributed data sources. In fact, although the relation- 
ships between search zones could be done by hand, the real 
challenge lies in OSINT applications implementing these 
jumps. 

In addition, the globalization of the process would not 
leave aside appealing open data sources from different ter- 
ritories which actually could fill the gaps we need to address 
in our investigation. In Spain, for instance, we use tools 
that are designed in (and for) foreign countries. However, 
there are not OSINT solutions which include Spanish public 
repositories in the gathering phase (as government open data 
platforms could be). In this sense, we are not fully benefiting 
yet from the goldmine that supposes being one of the most 
transparent countries in Europe. 

A generic and flexible implementation is specially useful 
for nomad targets in whom mobility is part of their daily 
lives. Say that the investigated target is a person who has lived 
stages of his life in several countries, or companies which 
have headquarters on several continents, or even criminals 
who change their location to make it more difficult to pursue 
them. In these cases, a static search in a particular country 
would leave a lot of information uncollected and a lot of clues 
unanalyzed. 


F. AWARENESS OF PRIVACY, ETHICAL AND 

LEGAL CONSIDERATIONS 

From an ethical point of view, OSINT must respect the user’s 
privacy so as not to harm his private life, as well as the 
privacy of his family, friends and co-workers. The fact that 
the information is publicly accessible does not mean that 
it is not sensitive. Knowing the personal preferences and 
tastes of the target can perpetrate in his privacy. Revealing 
political thoughts can have fatal consequences in certain 
places. Communicating a sexual orientation can be poten- 
tially life threatening in certain countries. Knowing religious 
beliefs can lead to criminal convictions in specific terri- 
tories. Thus, the open source information has to be han- 
dled carefully, for legitimate purposes, in the interests of 
society. 

From the legal point of view, OSINT should be used on the 
basis of a law and respecting data protection policies. With 
the advent of the EU GDPR, the regulation concerning the 
personal data has changed [86]. In this sense, personal data 
comprise any information which can relate to any citizen. 
Moreover, different pieces of information, which collected 
together can lead to the identification of an individual, also 
constitute personal data, even if the information is encrypted 
or anonymized [14]. A possible solution to address such 
challenge is to adapt the design of OSINT tools to embed nor- 
mative constraints, specially GPDR legal requirements [87]. 
By definition, OSINT is completely legal due to the public 
nature of the data sources it uses. Nevertheless, investigators 
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must not publish the gathered personal information, even if it 
is posted on the web. In addition, the user who applies OSINT 
cannot fall into the error of trying to impersonate the target in 
order to find more information. It should also be noted that 
authentication barriers cannot be broken in order to access 
the information we are looking for. 

In short, the use of OSINT should be restricted to legal 
activities and non-malicious purposes. In principle, OSINT 
does not (and should not) violate human freedom and rights, 
therefore its previously-mentioned techniques and services 
are legal to this extent [88]. It is a really powerful method- 
ology, but it is also dangerous if misused. Thanks to OSINT, 
journalists can provide up-to-date, objective and quality 
news. Human resources managers can get to know the appli- 
cants in their job better. Countries’ authorities can investigate 
criminal and terrorist groups. A company can audit its expo- 
sure abroad to cyberthreats. However, such openness to the 
utilization of OSINT techniques to specific categories should 
be always correctly justified [89]. 

On the downside, the OSINT end-user could be a delin- 
quent trying to commit a crime. A cracker could profile the 
target to increase the likelihood of success. A thief could 
analyze family members to steal from home at the best time. 
An extortionist could publish the private and personal infor- 
mation of the victim if a ransom is not paid. 

Developers have to consider the aforementioned aspects 
when implementing OSINT tools. In any case, for our sake, 
the most powerful tools should be only available to LEAs and 
Intelligence Agencies. 


G. BATTLE AGAINST OSINT MISUSE 

As already mentioned throughout the previous Sections, 
the potentialities of the OSINT paradigm are quite broad. 
In fact, it is indeed possible to take advantage of the open 
data for cybersecurity and cyberdefence purposes, thus inves- 
tigating the attackers and/or terrorist groups [90]. Never- 
theless, the exploitation of the publicly-available data is 
prone to abuse. That is, ill-motivated actors may lever- 
age the huge amount of information in order to commit 
cyber-aggressions, such as cyberbullying, cybergossip and 
cyber-victimization [91]. Unfortunately, those phenomena 
are increasingly and alarmingly more frequent on the Web, 
leading the victims to distress, loneliness, depression, and 
even to commit suicide in the worst case [16]. In particular, 
cybergossip is performed by group of people making evalu- 
ative comments via digital devices about somebody who is 
not present. This cyberbehavior affects the social group in 
which it occurs and can hinder peer relationships, damaging 
the victim of such process [92]. 

To this extent, it is important to control that the OSINT 
techniques and services are used in the correct manner, with- 
out harming others’ rights and freedom [93]. More specifi- 
cally, one could think to give different privileges based on 
end-user category, thus avoiding to grant full-access to the 
entire spectrum of information. For example, employees may 
have access to basic information in order to enhance their 
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tasks (e.g., for HR recruitment duties), while government and 
police forces may explore and investigate more open data 
(e.g., to hunt a cyber criminal). 

Finally, it is important to note that OSINT is enabling new 
proposals to combat this scourge of cyber-aggressions [94]. 
In this sense, OSINT misuse is likely to be properly detected 
actually with OSINT-based tools. 


X. CONCLUSION AND FUTURE WORK 

The widespread use of forums, social networks, or the media, 
as well as the large amount of existing data, turn Open Source 
Intelligence (OSINT) into the next Internet goldmine. The 
extraction of knowledge from public sources represents a way 
of resolving existing problems from a different and innovative 
perspective. Specifically, cybersecurity and cyberdefense can 
be greatly benefited by the results that this type of intel- 
ligence can offer. Therefore, automated OSINT processes 
should be implemented, capable of taking investigations to 
all parts of the Internet and extending our mind through the 
web. 

This paper described the status of OSINT today. It revealed 
that the effectiveness of current works is questionable due 
mainly to their poor application in real scenarios. In fact, there 
is a lack of serious approaches for transforming OSINT into 
a robust and self-managed solution. Nevertheless, we suggest 
the integration of OSINT into existing cyberdefence mecha- 
nisms to move from the atomic technical trails of a cyber inci- 
dent to the profile of the culprit or the identity of the suspect. 
The article also presented some OSINT techniques for basic 
searches and described the most sophisticated OSINT tools 
nowadays for advanced investigations. Depending on the data 
available and on the ultimate goal, a proper selection of the 
most appropriate tool would mark the difference. However, 
a varied combination of them is actually the key to achieve 
plausible results. 

In the context of Spain, we pointed out some indications 
which might confirm that Spanish Law Enforcement Agen- 
cies and Intelligence Services employ OSINT in their internal 
procedures. Despite being a confidential aspect of their func- 
tioning, OSINT is a crucial element in the context of their 
investigations. It is worth pointing out that Spain would be 
a large territory where to research, develop and apply this 
methodology due to its Open Data maturity. Actually, it is 
one of the most transparent countries of Europe, according to 
the European Data Portal. 

As future research directions, the article outlined some 
open challenges related to gathering, analyzing and extracting 
real knowledge from the immersion of the Internet. Aspects 
such as misinformation, privacy, and legality will be promi- 
nent in the future of OSINT. There is still a long way 
to go in this area, and to that end the community should 
address the discussed challenges by including advanced tech- 
niques and improving the current performance. The OSINT 
ultimate goal is to be able to ensure the desired finding 
for a certain purpose, in an automated and a self-driven 
way. 
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